{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cLKIel77CJPi"
   },
   "source": [
    "# Ungraded Lab: Subword Tokenization with the IMDB Reviews Dataset\n",
    "\n",
    "In this lab, you will look at tokenizing a dataset using subword text encoding. This is an alternative to word-based tokenization which you have been using in the previous labs. You will see how it works and its effect on preparing your data and training your model.\n",
    "\n",
    "Let's begin!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ULU4nGg2F7Bm"
   },
   "source": [
    "## Lab Setup\n",
    "\n",
    "First, you will install some additional packages in Colab and import the ones you will use in the next sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KgvM3LABFnLm"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_datasets as tfds\n",
    "import matplotlib.pyplot as plt\n",
    "import keras_nlp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qrzOn9quZ0Sv"
   },
   "source": [
    "## Load the IMDB Reviews dataset\n",
    "\n",
    "As you did in the first ungraded lab, you will load the [IMDB Reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) dataset from Tensorflow Datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_IoM4VFxWpMR"
   },
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "imdb = tfds.load(\"imdb_reviews\", as_supervised=True, data_dir='./data', download=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v3rwL6H3G9Cv"
   },
   "source": [
    "Then, extract the reviews and labels so you can preprocess them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "zAYgHw6TyfpQ"
   },
   "outputs": [],
   "source": [
    "train_reviews = imdb['train'].map(lambda review, label: review)\n",
    "train_labels = imdb['train'].map(lambda review, label: label)\n",
    "\n",
    "test_reviews = imdb['test'].map(lambda review, label: review)\n",
    "test_labels = imdb['test'].map(lambda review, label: label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "LPJXhkOKIl_f"
   },
   "source": [
    "You can preview a few reviews as a sanity check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LdUcjsr0ILO4"
   },
   "outputs": [],
   "source": [
    "# Show two reviews\n",
    "list(train_reviews.take(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YKrbY2fjjFHM"
   },
   "source": [
    "## Subword Tokenization\n",
    "\n",
    "From previous labs, the number of tokens in the sequence is the same as the number of words in the text (i.e. word tokenization). The following cells shows a review of this process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QduauF7D1n3g"
   },
   "outputs": [],
   "source": [
    "# Parameters for tokenization and padding\n",
    "VOCAB_SIZE = 10000\n",
    "MAX_LENGTH = 120\n",
    "PADDING_TYPE = 'pre'\n",
    "TRUNC_TYPE = 'post'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-N6Yd_TE3gZ5"
   },
   "outputs": [],
   "source": [
    "# Instantiate the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization(\n",
    "    max_tokens=VOCAB_SIZE\n",
    ")\n",
    "\n",
    "# Generate the vocabulary based only on the training set\n",
    "vectorize_layer.adapt(train_reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "KmuvzVS31OLA"
   },
   "outputs": [],
   "source": [
    "def padding_func(sequences):\n",
    "  '''Generates padded sequences from a tf.data.Dataset'''\n",
    "\n",
    "  # Put all elements in a single ragged batch\n",
    "  sequences = sequences.ragged_batch(batch_size=sequences.cardinality())\n",
    "\n",
    "  # Output a tensor from the single batch\n",
    "  sequences = sequences.get_single_element()\n",
    "\n",
    "  # Pad the sequences\n",
    "  padded_sequences = tf.keras.utils.pad_sequences(sequences.numpy(), \n",
    "                                                  maxlen=MAX_LENGTH, \n",
    "                                                  truncating=TRUNC_TYPE, \n",
    "                                                  padding=PADDING_TYPE\n",
    "                                                 )\n",
    "\n",
    "  # Convert back to a tf.data.Dataset\n",
    "  padded_sequences = tf.data.Dataset.from_tensor_slices(padded_sequences)\n",
    "\n",
    "  return padded_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "aknxBrRY1KTo"
   },
   "outputs": [],
   "source": [
    "# Apply the vectorization layer and padding on the training inputs\n",
    "train_sequences = train_reviews.map(lambda text: vectorize_layer(text)).apply(padding_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "nNUlDp76lf94"
   },
   "source": [
    "The cell above uses a `vocab_size` of 10000 but you'll find that it's easy to find OOV tokens when decoding using the lookup dictionary it created. See the result below and notice the `[UNK]` tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YmsECyVr4OPE"
   },
   "outputs": [],
   "source": [
    "# Get the vocabulary\n",
    "imdb_vocab_fullword = vectorize_layer.get_vocabulary()\n",
    "\n",
    "# Get a sample integer sequence\n",
    "sample_sequence = train_sequences.take(1).get_single_element()\n",
    "\n",
    "# Lookup each token in the vocabulary\n",
    "decoded_text = [imdb_vocab_fullword[token] for token in sample_sequence]\n",
    "\n",
    "# Combine the words\n",
    "decoded_text = ' '.join(decoded_text)\n",
    "\n",
    "# Print the output\n",
    "print(decoded_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "O0HQqkBmpujb"
   },
   "source": [
    "For binary classifiers, this might not have a big impact but you may have other applications that will benefit from avoiding OOV tokens when training the model (e.g. text generation). If you want the tokenizer above to not have OOVs, then you might have to increase the vocabulary size to more than 88k. Right now, it's only at 10k. This can slow down training and bloat the model size. The encoder also won't be robust when used on other datasets which may contain new words, thus resulting in OOVs again."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "McxNKhHIsNvl"
   },
   "source": [
    "*Subword text encoding* gets around this problem by using parts of the word to compose whole words. This makes it more flexible when it encounters uncommon words. You can use the [KerasNLP](https://keras.io/api/keras_nlp/) API to do just that."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "gwByl7OCNGl3"
   },
   "source": [
    "First, you will compute the subword vocabulary using the [compute_word_piece_vocabulary()](https://keras.io/api/keras_nlp/tokenizers/compute_word_piece_vocabulary/#compute_word_piece_vocabulary-function) function. You will tell it to:\n",
    "* learn from the `train_reviews`\n",
    "* set a max vocabulary size of 8k\n",
    "* reserve special tokens similar to the full word vocabulary\n",
    "* save the output to a file in the current directory\n",
    "\n",
    "***Note: This will take around 5 minutes to run. If you want to save some time, you can skip it and download the subword vocabulary in the next cell.***"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "h-tvKmx2Lqxj"
   },
   "outputs": [],
   "source": [
    "# Compute the subword vocabulary and save to a file\n",
    "keras_nlp.tokenizers.compute_word_piece_vocabulary(\n",
    "    train_reviews,\n",
    "    vocabulary_size=8000,\n",
    "    reserved_tokens=[\"[PAD]\", \"[UNK]\"],\n",
    "    vocabulary_output_file='imdb_vocab_subwords.txt'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3aS030JMRyt6"
   },
   "source": [
    "Next, you will initialize a [WordPieceTokenizer](https://keras.io/api/keras_nlp/tokenizers/word_piece_tokenizer/#wordpiecetokenizer-class) using the vocabulary. This will behave similar to the `TextVectorization` layer you've been using so far, but it is able to generate subword sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "hwJHhfTLXhsx"
   },
   "outputs": [],
   "source": [
    "# Uncomment this line if you skipped the cell above and want to use a pre-saved vocabulary\n",
    "# !wget -nc https://storage.googleapis.com/tensorflow-1-public/course3/imdb_vocab_subwords.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yd-WGsVOLvch"
   },
   "outputs": [],
   "source": [
    "# Initialize the subword tokenizer\n",
    "subword_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(\n",
    "    vocabulary='./imdb_vocab_subwords.txt'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yMNCxZ9xSgEy"
   },
   "source": [
    "See the vocabulary below. You'll notice that many of them are just parts of words, sometimes just single characters. Some also have a `##` which indicates that it is a suffix (i.e. something that is connected to a previous token). You'll see how this behaves later with an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SqyMSZbnwFBo"
   },
   "outputs": [],
   "source": [
    "# Print the subwords\n",
    "subword_tokenizer.get_vocabulary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kaRA9LBUwfHM"
   },
   "source": [
    "If you use it on the previous plain text sentence, you'll see that it won't have any OOVs even if it has a smaller vocab size (only around 8k compared to 10k above):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "B8HSViuDGNco"
   },
   "outputs": [],
   "source": [
    "# Show the size of the subword vocabulary\n",
    "subword_tokenizer.vocabulary_size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tn_eLaS5mR7H"
   },
   "outputs": [],
   "source": [
    "# Get a sample review\n",
    "sample_review = train_reviews.take(1).get_single_element()\n",
    "\n",
    "# Encode the first plaintext sentence using the subword text encoder\n",
    "tokenized_string = subword_tokenizer.tokenize(sample_review)\n",
    "print ('Tokenized string is {}'.format(tokenized_string))\n",
    "\n",
    "# Decode the sequence\n",
    "original_string = subword_tokenizer.detokenize(tokenized_string)\n",
    "\n",
    "# Print the result\n",
    "print('The original string: {}'.format(original_string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iL9O3hEqw4Bl"
   },
   "source": [
    "Subword encoding can even perform well on words that are not commonly found in movie reviews. First, see the result when using the full-word tokenizer. As expected, it will show many unknown words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MHRj1J0j8ApE"
   },
   "outputs": [],
   "source": [
    "# Define sample sentence\n",
    "sample_string = 'TensorFlow, from basics to mastery'\n",
    "\n",
    "# Encode using the plain text tokenizer\n",
    "tokenized_string = vectorize_layer(sample_string)\n",
    "print ('Tokenized string is {}'.format(tokenized_string))\n",
    "\n",
    "# Decode and print the result\n",
    "decoded_text = [imdb_vocab_fullword[token] for token in tokenized_string]\n",
    "original_string = ' '.join(decoded_text)\n",
    "print ('The original string: {}'.format(original_string))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZhQ-4O-uxdbJ"
   },
   "source": [
    "Then compare to the subword tokenizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fPl2BXhYEHRP"
   },
   "outputs": [],
   "source": [
    "# Encode using the subword text encoder\n",
    "tokenized_string = subword_tokenizer.tokenize(sample_string)\n",
    "print('Tokenized string is {}'.format(tokenized_string))\n",
    "\n",
    "# Decode and print the results\n",
    "original_string = subword_tokenizer.detokenize(tokenized_string).numpy().decode(\"utf-8\")\n",
    "print('The original string: {}'.format(original_string))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "89sbfXjz0MSW"
   },
   "source": [
    "As you may notice, the sentence is correctly decoded. The downside is the token sequence is much longer. Instead of only 5 when using the full-word tokenizer, you ended up with 12 tokens instead. The mapping for this sentence is shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_3t7vvNLEZml"
   },
   "outputs": [],
   "source": [
    "# Show token to subword mapping:\n",
    "for ts in tokenized_string:\n",
    "  print ('{} ----> {}'.format(ts, subword_tokenizer.detokenize([ts]).numpy().decode(\"utf-8\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aZ22ugch1TFy"
   },
   "source": [
    "## Training the model\n",
    "\n",
    "You will now train your model using the subword-tokenized dataset using the same process as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LVSTLBe_SOUr"
   },
   "outputs": [],
   "source": [
    "SHUFFLE_BUFFER_SIZE = 10000\n",
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Generate integer sequences using the subword tokenizer\n",
    "train_sequences_subword = train_reviews.map(lambda review: subword_tokenizer.tokenize(review)).apply(padding_func)\n",
    "test_sequences_subword = test_reviews.map(lambda review: subword_tokenizer.tokenize(review)).apply(padding_func)\n",
    "\n",
    "# Combine the integer sequence and labels\n",
    "train_dataset_vectorized = tf.data.Dataset.zip(train_sequences_subword,train_labels)\n",
    "test_dataset_vectorized = tf.data.Dataset.zip(test_sequences_subword,test_labels)\n",
    "\n",
    "# Optimize the datasets for training\n",
    "train_dataset_final = (train_dataset_vectorized\n",
    "                       .shuffle(SHUFFLE_BUFFER_SIZE)\n",
    "                       .cache()\n",
    "                       .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)\n",
    "                       .batch(BATCH_SIZE)\n",
    "                       )\n",
    "\n",
    "test_dataset_final = (test_dataset_vectorized\n",
    "                      .cache()\n",
    "                      .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)\n",
    "                      .batch(BATCH_SIZE)\n",
    "                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HCjHCG7s2sAR"
   },
   "source": [
    "Next, you will build the model. You can just use the architecture from the previous lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5NEpdhb8AxID"
   },
   "outputs": [],
   "source": [
    "# Define dimensionality of the embedding\n",
    "EMBEDDING_DIM = 64\n",
    "\n",
    "# Build the model\n",
    "model = tf.keras.Sequential([\n",
    "    tf.keras.Input(shape=(MAX_LENGTH,)),\n",
    "    tf.keras.layers.Embedding(subword_tokenizer.vocabulary_size(), EMBEDDING_DIM),\n",
    "    tf.keras.layers.GlobalAveragePooling1D(),\n",
    "    tf.keras.layers.Dense(6, activation='relu'),\n",
    "    tf.keras.layers.Dense(1, activation='sigmoid')\n",
    "])\n",
    "\n",
    "# Print the model summary\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2aOn2bAc3AUj"
   },
   "source": [
    "Similarly, you can use the same parameters for training. In Colab, it will take around 10 to 15 seconds per epoch (without an accelerator) and you will reach around 92% training accuracy and 77% validation accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fkt8c5dNuUlT"
   },
   "outputs": [],
   "source": [
    "num_epochs = 10\n",
    "\n",
    "# Set the training parameters\n",
    "model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "# Start training\n",
    "history = model.fit(train_dataset_final, epochs=num_epochs, validation_data=test_dataset_final)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3ygYaD6H3qGX"
   },
   "source": [
    "## Visualize the results\n",
    "\n",
    "You can use the cell below to plot the training results. See if you can improve it by tweaking the parameters such as the size of the embedding and number of epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "-_rMnm7WxQGT"
   },
   "outputs": [],
   "source": [
    "def plot_loss_acc(history):\n",
    "  '''Plots the training and validation loss and accuracy from a history object'''\n",
    "  acc = history.history['accuracy']\n",
    "  val_acc = history.history['val_accuracy']\n",
    "  loss = history.history['loss']\n",
    "  val_loss = history.history['val_loss']\n",
    "\n",
    "  epochs = range(len(acc))\n",
    "\n",
    "  fig, ax = plt.subplots(1,2, figsize=(12, 6))\n",
    "  ax[0].plot(epochs, acc, 'bo', label='Training accuracy')\n",
    "  ax[0].plot(epochs, val_acc, 'b', label='Validation accuracy')\n",
    "  ax[0].set_title('Training and validation accuracy')\n",
    "  ax[0].set_xlabel('epochs')\n",
    "  ax[0].set_ylabel('accuracy')\n",
    "  ax[0].legend()\n",
    "\n",
    "  ax[1].plot(epochs, loss, 'bo', label='Training Loss')\n",
    "  ax[1].plot(epochs, val_loss, 'b', label='Validation Loss')\n",
    "  ax[1].set_title('Training and validation loss')\n",
    "  ax[1].set_xlabel('epochs')\n",
    "  ax[1].set_ylabel('loss')\n",
    "  ax[1].legend()\n",
    "\n",
    "  plt.show()\n",
    "\n",
    "plot_loss_acc(history)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R0TRE-Lb4C5b"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "In this lab, you saw how subword tokenization can be a robust technique to avoid out-of-vocabulary tokens. It can decode uncommon words it hasn't seen before even with a relatively small vocab size. Consequently, it results in longer token sequences when compared to full word tokenization. Next week, you will look at other architectures that you can use when building your classifier. These will be recurrent neural networks and convolutional neural networks."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
